{ "cells": [ { "cell_type": "markdown", "id": "5d06b060", "metadata": {}, "source": [ "# Pandas\n", "\n", "## Introduction\n", "Pandas is an open-source python library which is bascially used for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.\n", "\n", "### Installation using pip" ] }, { "cell_type": "code", "execution_count": 1, "id": "0db7ccfe", "metadata": {}, "outputs": [], "source": [ "# !pip install pandas" ] }, { "cell_type": "markdown", "id": "ec9bbd19", "metadata": {}, "source": [ "We can use pandas by importing it in the file like:" ] }, { "cell_type": "code", "execution_count": 2, "id": "c7792003", "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "id": "231658a9", "metadata": {}, "source": [ "### Creating data\n", "\n", "\n", "There are two core concepts in the pandas: **DataFrame** and **Series**\n", "\n", "\n", "#### DataFrame\n", "\n", "A DataFrame is a table. It contains an array of individual entries, each of which has a certain value. Each entry corresponds to a row (or record) and a column.\n", "\n", "For example, consider the following simple DataFrame:" ] }, { "cell_type": "code", "execution_count": 3, "id": "01049cbc", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
LikesDislikes
013011
11212
\n", "
" ], "text/plain": [ " Likes Dislikes\n", "0 130 11\n", "1 121 2" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame({'Likes': [130, 121], 'Dislikes': [11, 2]})" ] }, { "cell_type": "markdown", "id": "b31e5fb8", "metadata": {}, "source": [ "In this example, the \"0, Likes\" entry has the value of 130. The \"0, Dislikes\" entry has a value of 11, and so on.\n", "\n", "DataFrame entries are not limited to integers. For instance, here's a DataFrame whose values are strings:" ] }, { "cell_type": "code", "execution_count": 4, "id": "142e0642", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AnonymousAnalyst
0I liked it.Looks good.
1It was great!Informative
\n", "
" ], "text/plain": [ " Anonymous Analyst\n", "0 I liked it. Looks good.\n", "1 It was great! Informative" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame({'Anonymous': ['I liked it.', 'It was great!'], 'Analyst': ['Looks good.', 'Informative']})" ] }, { "cell_type": "markdown", "id": "3e20b8d6", "metadata": {}, "source": [ "We are using the `pd.DataFrame()` constructor to generate these DataFrame objects. The syntax for declaring a new one is a dictionary whose keys are the column names (Anonymous and Analyst in this example), and whose values are a list of entries. This is the standard way of constructing a new DataFrame, and the one you are most likely to encounter.\n", "\n", "\n", "The dictionary-list constructor assigns values to the column labels, but just uses an ascending count from 0 (0, 1, 2, 3, ...) for the row labels. Sometimes this is OK, but oftentimes we will want to assign these labels ourselves.\n", "\n", "\n", "The list of row labels used in a DataFrame is known as an **Index**. We can assign values to it by using an `index` parameter in our constructor:" ] }, { "cell_type": "code", "execution_count": 5, "id": "7086d7f3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AnonymousAnalyst
Product AI liked it.Looks good.
Product BIt was great!Informative
\n", "
" ], "text/plain": [ " Anonymous Analyst\n", "Product A I liked it. Looks good.\n", "Product B It was great! Informative" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame({'Anonymous': ['I liked it.', 'It was great!'], \n", " 'Analyst': ['Looks good.', 'Informative']},\n", " index=['Product A', 'Product B'])" ] }, { "cell_type": "markdown", "id": "8d5a6409", "metadata": {}, "source": [ "#### Series\n", "\n", "A Series, by contrast, is a sequence of data values. If a DataFrame is a table, a Series is a list. And in fact you can create one with nothing more than a list:" ] }, { "cell_type": "code", "execution_count": 6, "id": "2cb5cff6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 3\n", "3 4\n", "4 5\n", "dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series([1, 2, 3, 4, 5])" ] }, { "cell_type": "markdown", "id": "9cbecad5", "metadata": {}, "source": [ "A Series is, in essence, a single column of a DataFrame. So you can assign column values to the Series the same way as before, using an `index` parameter. However, a Series does not have a column name, it only has one overall `name`:" ] }, { "cell_type": "code", "execution_count": 7, "id": "07354f95", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2017 Sales 400\n", "2018 Sales 515\n", "2019 Sales 605\n", "Name: Product A, dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series([400, 515, 605], index=['2017 Sales', '2018 Sales', '2019 Sales'], name='Product A')" ] }, { "cell_type": "markdown", "id": "7d045663", "metadata": {}, "source": [ "### Reading Data Files\n", "\n", "Being able to create a DataFrame or Series by hand is handy. But, most of the time, we won't actually be creating our own data by hand. Instead, we'll be working with data that already exists.\n", "\n", "Data can be stored in any of a number of different forms and formats. By far the most basic of these is the humble CSV file. When you open a CSV file you get something that looks like this:\n", "\n", "***\n", "Product A,Product B,Product C, \n", "\n", "30,21,9, \n", "\n", "35,34,1, \n", "\n", "41,11,11\n", "\n", "***\n", "\n", "So a CSV file is a table of values separated by commas. Hence the name: \"Comma-Separated Values\", or CSV.\n", "\n", "\n", "Let's now try to read a very famous dataset, known as **Golf Dataset**. We'll use the `pd.read_csv()` function to read the data into a DataFrame. This goes thusly:" ] }, { "cell_type": "code", "execution_count": 8, "id": "70962998", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OutlookTempHumidityWindyPlay Golf
0RainyHotHighFalseNo
1RainyHotHighTrueNo
2OvercastHotHighFalseYes
3SunnyMildHighFalseYes
4SunnyCoolNormalFalseYes
5SunnyCoolNormalTrueNo
6OvercastCoolNormalTrueYes
7RainyMildHighFalseNo
8RainyCoolNormalFalseYes
9SunnyMildNormalFalseYes
10RainyMildNormalTrueYes
11OvercastMildHighTrueYes
12OvercastHotNormalFalseYes
13SunnyMildHighTrueNo
\n", "
" ], "text/plain": [ " Outlook Temp Humidity Windy Play Golf\n", "0 Rainy Hot High False No\n", "1 Rainy Hot High True No\n", "2 Overcast Hot High False Yes\n", "3 Sunny Mild High False Yes\n", "4 Sunny Cool Normal False Yes\n", "5 Sunny Cool Normal True No\n", "6 Overcast Cool Normal True Yes\n", "7 Rainy Mild High False No\n", "8 Rainy Cool Normal False Yes\n", "9 Sunny Mild Normal False Yes\n", "10 Rainy Mild Normal True Yes\n", "11 Overcast Mild High True Yes\n", "12 Overcast Hot Normal False Yes\n", "13 Sunny Mild High True No" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "golf_dataset = pd.read_csv(\"./Data/Pandas/golf-dataset.csv\")\n", "\n", "golf_dataset" ] }, { "cell_type": "markdown", "id": "136f0352", "metadata": {}, "source": [ "##### ```shape```\n", "\n", "We can use the `shape` attribute to check how large the resulting DataFrame is:" ] }, { "cell_type": "code", "execution_count": 9, "id": "2d5d6e44", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(14, 5)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "golf_dataset.shape" ] }, { "cell_type": "markdown", "id": "25729775", "metadata": {}, "source": [ "##### ```head```\n", "\n", "We can examine the contents of the resultant DataFrame using the head() command, which grabs the first five rows:" ] }, { "cell_type": "code", "execution_count": 10, "id": "a345fcf8", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OutlookTempHumidityWindyPlay Golf
0RainyHotHighFalseNo
1RainyHotHighTrueNo
2OvercastHotHighFalseYes
3SunnyMildHighFalseYes
4SunnyCoolNormalFalseYes
\n", "
" ], "text/plain": [ " Outlook Temp Humidity Windy Play Golf\n", "0 Rainy Hot High False No\n", "1 Rainy Hot High True No\n", "2 Overcast Hot High False Yes\n", "3 Sunny Mild High False Yes\n", "4 Sunny Cool Normal False Yes" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "golf_dataset.head()" ] }, { "cell_type": "markdown", "id": "ba611a80", "metadata": {}, "source": [ "In Python, we can access the property of an object by accessing it as an attribute. A `book` object, for example, might have a `title` property, which we can access by calling `book.title`. Columns in a pandas DataFrame work in much the same way.\n", "\n", "Hence to access the `Temp` property of `golf_dataset` we can use:" ] }, { "cell_type": "code", "execution_count": 11, "id": "f0e68caf", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 Hot\n", "1 Hot\n", "2 Hot\n", "3 Mild\n", "4 Cool\n", "5 Cool\n", "6 Cool\n", "7 Mild\n", "8 Cool\n", "9 Mild\n", "10 Mild\n", "11 Mild\n", "12 Hot\n", "13 Mild\n", "Name: Temp, dtype: object" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "golf_dataset.Temp" ] }, { "cell_type": "markdown", "id": "72a9fd1d", "metadata": {}, "source": [ "### Indexing in Pandas\n", "\n", "The indexing operator and attribute selection are nice because they work just like they do in the rest of the Python ecosystem. As a novice, this makes them easy to pick up and use. However, pandas has its own accessor operators, loc and iloc. For more advanced operations, these are the ones you're supposed to be using.\n", "\n", "#### Index-based Selection\n", "\n", "Pandas indexing works in one of two paradigms. The first is **index-based selection**: selecting data based on its numerical position in the data. `iloc` follows this paradigm.\n", "\n", "To select the first row of data in a DataFrame, we may use the following:" ] }, { "cell_type": "code", "execution_count": 12, "id": "9177287e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Outlook Rainy\n", "Temp Hot\n", "Humidity High\n", "Windy False\n", "Play Golf No\n", "Name: 0, dtype: object" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "golf_dataset.iloc[0]" ] }, { "cell_type": "markdown", "id": "21c3c586", "metadata": {}, "source": [ "Both `loc` and `iloc` are row-first, column-second. This is the opposite of what we do in native Python, which is column-first, row-second.\n", "\n", "This means that it's marginally easier to retrieve rows, and marginally harder to get retrieve columns. To get a column with `iloc`, we can do the following:" ] }, { "cell_type": "code", "execution_count": 13, "id": "86769edd", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 Rainy\n", "1 Rainy\n", "2 Overcast\n", "3 Sunny\n", "4 Sunny\n", "5 Sunny\n", "6 Overcast\n", "7 Rainy\n", "8 Rainy\n", "9 Sunny\n", "10 Rainy\n", "11 Overcast\n", "12 Overcast\n", "13 Sunny\n", "Name: Outlook, dtype: object" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "golf_dataset.iloc[:,0]" ] }, { "cell_type": "markdown", "id": "5d38b616", "metadata": {}, "source": [ "#### Label-based Selection\n", "\n", "The second paradigm for attribute selection is the one followed by the `loc` operator: **label-based selection**. In this paradigm, it's the data index value, not its position, which matters.\n", "\n", "For example, to get the first entry in `golf_dataset`, we would now do the following:" ] }, { "cell_type": "code", "execution_count": 14, "id": "a75834ca", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Rainy'" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "golf_dataset.loc[0,'Outlook']" ] }, { "cell_type": "markdown", "id": "2438c75f", "metadata": {}, "source": [ "`iloc` is conceptually simpler than `loc` because it ignores the dataset's indices. When we use `iloc` we treat the dataset like a big matrix (a list of lists), one that we have to index into by position. `loc`, by contrast, uses the information in the indices to do its work. Since your dataset usually has meaningful indices, it's usually easier to do things using `loc` instead.\n", "\n", "For example, here's one operation that's much easier using `loc`:" ] }, { "cell_type": "code", "execution_count": 15, "id": "c2f3fba0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OutlookTemp
0RainyHot
1RainyHot
2OvercastHot
3SunnyMild
\n", "
" ], "text/plain": [ " Outlook Temp\n", "0 Rainy Hot\n", "1 Rainy Hot\n", "2 Overcast Hot\n", "3 Sunny Mild" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "golf_dataset.loc[:3, ['Outlook', 'Temp']]" ] }, { "cell_type": "markdown", "id": "1035d0cf", "metadata": {}, "source": [ "### Summary functions in Pandas\n", "\n", "Pandas provides many simple \"summary functions\" (not an official name) which restructure the data in some useful way. For example, consider the `describe()` method:" ] }, { "cell_type": "code", "execution_count": 16, "id": "53b9cdcd", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OutlookTempHumidityWindyPlay Golf
count1414141414
unique33222
topRainyMildHighFalseYes
freq56789
\n", "
" ], "text/plain": [ " Outlook Temp Humidity Windy Play Golf\n", "count 14 14 14 14 14\n", "unique 3 3 2 2 2\n", "top Rainy Mild High False Yes\n", "freq 5 6 7 8 9" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "golf_dataset.describe()" ] }, { "cell_type": "markdown", "id": "1ccf6afb", "metadata": {}, "source": [ "or we can find out summary for an individual attribute like for `Temp`:" ] }, { "cell_type": "code", "execution_count": 17, "id": "87a3bc6b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 14\n", "unique 3\n", "top Mild\n", "freq 6\n", "Name: Temp, dtype: object" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "golf_dataset.Temp.describe()" ] }, { "cell_type": "markdown", "id": "2313f7e9", "metadata": {}, "source": [ "To see a list of unique values we can use the `unique()` function:" ] }, { "cell_type": "code", "execution_count": 18, "id": "9449c5ba", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['Hot', 'Mild', 'Cool'], dtype=object)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "golf_dataset.Temp.unique()" ] }, { "cell_type": "markdown", "id": "3517d7b7", "metadata": {}, "source": [ "### Dtypes" ] }, { "cell_type": "markdown", "id": "5ec6317e", "metadata": {}, "source": [ "The data type for a column in a DataFrame or a Series is known as the **dtype**.\n", "\n", "You can use the `dtype` property to grab the type of a specific column. For instance, we can get the dtype of the `Windy` column in the `golf_dataset` DataFrame:" ] }, { "cell_type": "code", "execution_count": 19, "id": "36bd64f0", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dtype('bool')" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "golf_dataset.Windy.dtype" ] }, { "cell_type": "markdown", "id": "86d79071", "metadata": {}, "source": [ "Alternatively, the `dtypes` property returns the dtype of every column in the DataFrame" ] }, { "cell_type": "code", "execution_count": 20, "id": "171ce1c6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Outlook object\n", "Temp object\n", "Humidity object\n", "Windy bool\n", "Play Golf object\n", "dtype: object" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "golf_dataset.dtypes" ] }, { "cell_type": "markdown", "id": "443cc4ac", "metadata": {}, "source": [ "## Further Readings\n", "\n", "We've seen the most commonly used functions of Pandas that are used in the field of Machine Learning and Data Analysis. However there is a variety of other functions that Pandas provides us.\n", "\n", "For further understanding, or to explore more you may refer to the official documentation of Panadas at:\n", "\n", "https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 5 }